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The gene encoding the spike glycoprotein of the human 
coronavirus HCV 229E has been cloned and se- 
quenced. This analysis predicts an S polypeptide of 
1173 amino acids with an M, of 128600. The 
polypeptide has 30 potential N-glycosylation sites. A 
number of structural features typical of coronavirus S 
proteins can be recognized, including a signal sequence, 
a membrane anchor, heptad repeat structures and a 
carboxy-terminal cysteine cluster. A detailed, comput- 
er-aided comparison with the S proteins of infectious 


bronchitis virus, feline infectious peritonitis virus, 
transmissible gastroenteritis virus and murine hepatitis 
virus, strain JHM is presented. We have also done a 
Northern blot analysis of viral RNAs in HCV 229E- 
infected cells using synthetic oligonucleotides. On the 
basis of this analysis, and by analogy to the replication 
strategy of other coronaviruses, we are able to propose 
a model for the organization and expression of the 
HCV 229E genome. 


Introduction 


Human coronaviruses (HCV) are a common cause of 
respiratory disease in man and it has been estimated that 
they are responsible for up to 20% of common colds 
(Hierholzer & Tannock, 1988; Isaacs et al., 1983; 
McIntosh et al., 1974). With a few exceptions, HCVs are 
difficult to propagate in tissue or organ culture and 
consequently their biology is relatively poorly under- 
stood. Nevertheless, it has been possible to establish that 
there are two major HCV antigenic groups, represented 
by HCV 229E and HCV OC43 (Macnaughton, 1981; 
Pedersen et al., 1978). 

The HCV 229E virion consists of the genomic RNA, 
which if HCV is similar to other coronaviruses will be 
about 30 kb, a lipid envelope and three major proteins: 
the nucleocapsid protein, N (7, of 50K), the membrane 
glycoprotein, M (M, of 21K to 25K) and the spike 
glycoprotein, S (M, of 186K) (Kemp et al., 1984; 
Macnaughton & Madge, 1978; Schmidt & Kenny, 1982). 
Human coronaviruses of the OC43 group possess an 
additional surface glycoprotein, the haemagglutinin— 
esterase, HE (M, of 65K) (Hogue & Brian, 1986). 

The HCV replication strategy involves the synthesis of 
subgenomic RNAs in the cytoplasm of infected cells 
(Weiss & Leibowitz, 1981). It is assumed that these 
subgenomic RNAs are synthesized by a process of 
leader-primed discontinuous transcription as has been 
described for the murine hepatitis virus (MHV) (Baric et 
al., 1985; Makino et al., 1986; Shieh et al., 1987). This 
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process involves the recognition of a specific sequence, 
the so-called ‘region of homology’, present at the 3’ end of 
a leader RNA and at each intergenic transcriptional 
reinitiation site on the antigenomic RNA template (for a 
review see Lai et al., 1987). 

In the case of HCV this process results in a set of six 3’ 
coterminal subgenomic RNAs (Kamahora et al., 1989; 
Schreiber et al., 1989). By analogy to other coronaviruses 
the 5’ unique region in each RNA (i.e. the region not 
present in the next smallest RNA) should be translated 
and, at least in the case of the RNAs encoding structural 
proteins, they should be expressed as a single polypeptide 
(Spaan et al., 1988). 

Recently, the HCV 229E genes encoding the N protein 
and the M glycoprotein have been cloned and sequenced 
(Raabe & Siddell, 1989a; Schreiber et al., 1989). Also, 
sequence analysis of the genomic region upstream from 
the M protein gene has revealed three open reading 
frames (ORFs) with the potential to encode polypeptides 
of 15-3K, 10-2K and 9-1K (Raabe & Siddell, 19895). As 
proteins of this size have not been identified in virions 
(Schmidt & Kenny, 1982), these genes are thought to 
encode non-structural components. A similar arrange- 
ment of structural and non-structural genes has been 
shown for a number of other coronaviruses (Spaan et al., 
1988). 

The spike glycoprotein of coronaviruses forms the 
characteristic peplomer structures on the surface of the 
virion. The protein is a large, acylated glycopolypeptide 
with an M,, depending upon the virus in question, of 
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between 170K and 200K (Spaan ef al., 1988). Each 
peplomer consists of a dimer or trimer of S proteins 
(Cavanagh, 1983) which in the case of MHV and 
infectious bronchitis virus (IBV), but not feline infec- 
tious peritonitis virus (FIPV) or transmissible gastroen- 
teritis virus (TGEV) have been cleaved into two non- 
identical subunits, the amino-terminal S1 and the 
carboxy-terminal 82. 

In the case of HCV 229E it has been shown that the S 
protein is the major antigenic determinant in natural 
infections and has a central role in the induction of the 
immune response (Macnaughton et al., 1981). Studies on 
other coronaviruses have shown that the same protein 
also mediates such essential biological functions as 
attachment of the virion to the cell surface and the fusion 
of viral and cellular membranes (de Groot et al., 1989; 
Sturman & Holmes, 1985). 

In the long term, our aim is to define the role of the S 
protein in the pathogenesis of HCV 229E infections as 
well as its interaction with the human immune system. 
As a first step, we present the complete nucleotide 
sequence of the HCV 229E S gene and compare the 
predicted amino acid sequence with other recently 
determined coronavirus S protein sequences. Also, on 
the basis of analogy to other coronaviruses, recently 
published sequence data (Raabe & Siddell, 1989a, b; 
Schreiber er al., 1989) and a Northern blot analysis of 
intracellular viral RNA, we propose a model for the 
organization and expression of the HCV 229E genome. 


Methods 


Virus and cells. The HCV 229E strain used in these studies was 
isolated from a volunteer at the MRC Common Cold Unit, Salisbury, 
U.K. The virus was adapted to tissue culture by passage in C16 cells, a 
heteroploid cell tine of human origin (Phillpotts, 1983). The virus was 
titrated by limiting dilution and the supernatant from a well with one 
focus of infection was taken as the primary virus stock. C16 cells were 
infected with HCV 229E at an m.o.i. of 3, incubated at 33°C, and 
cytoplasmic RNA was isolated 48 h p.i. using standard procedures 
(Siddell, 1983). Polyadenylated RNA was fractionated by chromato- 
graphy on poly(U)-Sepharose. 


cDNA cloning. Two cDNA libraries were prepared essentially 
according to the method of Gubler & Hoffman (1983), using either 
random hexanucleotides or an S gene-specific oligonucleotide (posi- 
tions 227 to 244, Fig. 1) as first-strand primer. The synthesized ds 
cDNA was size-fractionated on a Sephacryl S-1000 column, ligated to 
EcoRI linkers and cloned into the Bluescript vector pKS II* 
(Stratagene). Recombinant clones were screened by colony hybridiza- 
tion with HCV 229E-specific oligonucleotides. Plasmid purification, 
agarose gel electrophoresis, colony hybridizations and standard 
recombinant DNA procedures were done as described by Maniatis et 
al. (1982). 


Sequence analysis. cDNA was subcloned by digestion with restriction 
enzymes and ligation into Smal-linearized Mi3mp19 vector DNA 
(Messing & Vieira, 1982). The sequence of clone 11B5 was obtained 


after generation of a series of overlapping deletions using exonuclease 
III (Henikoff, 1984). Sequencing was done on ds and ss DNA templates 
using the chain termination method (Sanger et al., 1977) with the M13 
universal primer or S gene-specific oligonucleotide primers. The 
sequences presented were determined completely on both cDNA 
strands. Sequence data were assembled by the programs of Staden 
(1982) and analysed by the programs of the University of Wisconsin 
Computer Genetics Group (Devereux et al., 1984). 


Northern blot analysis. Polyadenylated RNA from HCV 229E- 
infected C16 cells was electrophoresed on 0-9% agarose-formaldehyde 
gels and transferred onto nitrocellulose membranes using standard 
procedures (Maniatis et al., 1982). HCV 229E-specific oligonucleotides 
were synthesized using phosphoramidite chemistry on a Cyclone DNA 
synthesizer and purified by gel electrophoresis. Oligonucleotides were 
5’ end-labelled with [y-32P]ATP and hybridized using the conditions 
described by Woods (1984). The oligonucleotides used were 5’ 
GCAACCACCGGGTATATC 3 (A), 5’ AACATCAGTCTG- 
CAATGC 3’ (B), 5° GAGCCATTACTGTATGTG 3° (C), 5’ 
CGAATGGTTTCAGAGCCT 3’ (D), 5’ CAACAGCTGGGTGTT- 
CAC 3’ (E), 5° ATACACACTAGTAGTATC 3’ (F) and 5% 
TCCCAATTAGCCCAGGTG 3’ (G). 

A cDNA probe specific for the HCV 229E N gene was prepared by 
nick translation of plasmid pSMF1 DNA (Myint er al., 1989) and 
hybridized under standard conditions (Maniatis et a/., 1982). 


Results 
Characterization of HCV 229E-specific cDNA clones 


An 18 base oligonucleotide complementary to a sequence 
near the 5’ end of the HCV 229E N gene was used to 
screen a randomly primed cDNA library derived from 
polyadenylated RNA extracted from HCV 229E-infect- 
ed cells (Raabe & Siddell, 1989 a). Plasmid 2F7 contained 
a 4:2 kbcDNA insert which hybridized to all HCV 229E 
RNAs (data not shown). Sequence analysis of clone 2F7 
showed that the insert cDNA extends from a position 
within the N gene to a position within the S gene (see Fig. 
2). An oligonucleotide complementary to the 5’ end of 
clone 2F7 (using the mRNA orientation, nucleotides 
1209 to 1226, Fig. 1) was used to identify clone 11B5 
which overlaps 2F7 by 3 kb and extends a further 1 kb in 
the 5’ direction. Finally, a second series of cDNA clones 
were synthesized using an oligonucleotide primer based 
upon sequences derived from the 5’ end of clone 11B5 
(nucleotides 227 to 244, Fig. 1). One such clone, 5BS5, 
encompasses the 5’ end of the S gene and extends 
approximately 2-5 kb in the S’ direction. Another, 8E10, 
contains a 250 bp insert and terminates at the 5’ end with 
a sequence previously identified as the HCV 229E leader 
RNA (Schreiber et a/., 1989). Fig. 2 shows the location of 
these cDNA clones with respect to the genomic and 
subgenomic RNAs. 


Sequence analysis of the HCV 229E S protein gene 


The nucleotide sequence of the HCV 229E S gene 
together with the predicted amino acid sequence of the S 


1.8E10 


91 


181 


271 


361 


451 


541 


631 


721 


811 


901 


991 


1081 


1171 


1261 


1351 


1441 


1531 


1621 


1711 


1801 


1891 


1981 


2071 


2161 


2251 


2341 


2431 


2521 


HCV 229E S protein gene 


TTTTTAGACTTTGTGTCTACTTT. * . . : . . 
TTTGAGTTTTAGTAATCATTTAGTCTCAACTAAALAAAATGTTTGTTTTGCTTGTTGCATATGCCTTGTTGCATATTGCTGGTIGTCAAA 
MF V_L_L Vi ALY ALL LL H ILA G C Q T 


CTACAAATGGGCTGAACACTAGTTACTCTGTTTGCAACGGCTGTGTTGGTTATT CAGAAAATGTATTTGCTGTTGAGAGTGGTGGTTATA 
T N G L a T S$ ¥ S§ Ve¢eN Gc VG ¥ S$ E N V F A V E S G G ¥ Ff 
TACCCTCCGACTTTGCATTCAATAATTGGTTCCTTCTAACTAATACCTCATCTGTTGTAGATGGTGTTGTGAGGAGTTTTCAGCCTTTGT 
P S D F A F N N W F LBL T a T Ss S§ VV DG ¥V VR S F Q P L LU 


TGCTTAATTGCTTATGGTCTGTTTCTGGCTTGCGGTTTACTACTGGTTTITGTCTATTTTAATGGTACTGGGAGAGGTGATTGTAAAGGTT 
LN c L WS V S G LB R F T T G F V ¥ F N G T G R G DB ¢€¢ K G F 


TTITCCTCAGATGTTTTGTCTGATGTCATACGTTACAACCTCAATTTTGAAGAAAACCTTAGACGTGGAACCATTTTGTTTAAAACATCTT 
s S$ DV Lbs DV I R ¥ N BN F E E N LF RR G Tf FT LF K T S ¥ 


ATGGTGTTGTTGTGTTTTATTGTACCAACAACACTTTAGTTTCAGGTGATGCTCACATACCATTTGGTACAGTTTITGGGCAATTTTTATT 
G Vv Vv Vv F ¥ ¢€ fT a N T L VS G@ D A H IT P F G@G T VL @ N F ¥ € 


GCTTTGTAAATACTACTATTGGCAATGAAACTACGTCTGCTTTTGTGGGTGCACTACCTAAGACAGTTCGTGAGTTTGTTATTTCACGCA 
F WV x Tr fF. "G 7 ET T S$ A F V GA L P K T VR E F V I S$ R T 


CAGGACATTTTTATATTAATGGCTATCGCTATTTCACTTTAGGTAATGTAGAAGCCGTTAATTTCAATGTCACTACTGCAGAAACCACTG 
G H F ¥ I NWN GC Y R Y F FT GT EF N VE A VN F N Vo f AOE L£ T- DB 


ATTTTTGTACTGTTGCGTTAGCTTCTTATGCTGACGTTTTGGTTAATGTGT CACAAACCTCTATTGCTAATATAATTTATTGCAACTCTG 
F c T V A LA S$ ¥Y A DVL V y ves QT $§ I AN YT IT ¥ C N S V 


TTATTAACAGACTGAGATGTGACCAGTTGTCCTTTGATGTACCAGATGGTTTTTATTCTACAAGCCCTATTCAATCCGTTGAGCTACCTG 
I N R L R CC DQ Lbs F DV PD G&G F ¥ S$ T S P IQs V E L P V 


TGTCTATTGTGTCGCTACCTGTTTATCATAAACATACGTTTATTGTGTTGTACGTTGACTTCAAACCTCAGAGTGGCGGTGGCAAGTGCT 
§ I v S$ L P WV ¥ H K H T F I VLY¥ v Db F K P Q S&S G GCG G K C F 


TTAACTGTTATCCTGCTGGTGTTAATATTACACTGGCCAATTTTAATGAAACTAAAGGGCCTTTGTGTGTTGACACATCACACTTCACTA 
N Cc ¥ P A GY Mf IT Lb A N F y ET K G P Le¢c¢v bD T S$ H F T T 


CCAAATACGTTGCTGTTTATGCCAATGTTGGTAGGTGGAGTGCTAGTATTAACACGGGAAATTGCCCTTTTTCTTTTGGCAAAGTTAATA 
K Y Vv A V ¥Y A N V G RW S AS I N T GN C P F S F GK V N N 


ACTTITGTTAAATTTGGCAGTGTATGTTTTTCGCTAAAGGATATACCCGGTGGTTGCGCAATGCCTATAGTGGCTAATTGGGCTTATAGTA 
F ov K F GS V ¢ F S§ LK DI P G©@ G C A M PIV AN W AY S K 


AGTACTATACTATAGGCTCATTGTATGTTTCTTGGAGTGATGGTGATGGAATTACTGGCGTCCCACAACCTGTTGAGGGTGTTAGTTCCT 
y ¥ T I G S$ L ¥ VS WS DG DGtI TT GCG VY PQ PV E G@ V S& S F 


TTATGAATGTTACATTGGACAAATGTACTAAATATAATATTTATGATGTATCTGGTGTGGGTGTTATTCGCGTTAGCAATGACACCTTTC 
M N vo eT LF DB K € T K ¥ N I ¥Y DVS GY GG V¥V I R V S 7 D T F OL 


TTAATGGAATTACGTACACATCAACTTCAGGTAACCTTCTGGGTTTTAAAGATGTTACTAAGGGCACCATCTACTCTATCACTCCTTGTA 
N G I T ¥ T S$ T S @ N LLG F K Dv T K G T I ¥ S I T P CN 


ACCCACCAGATCAGCTIGTTGTTTATCAGCAAGCTGTTGTTGGTGCTATGTTGTCTGAAAATTTTACTAGTTACGGCTTTTCTAATGTTG 
P PdbeQwULVvY Vv ¥ 9 Q A VV GAM L S EB 2 F T S ¥ G F S N V V 


TAGAACTGCCGAAATTTTTCTATGCGTCCAATGGCACTTATAATTGCACAGACGCTGTTTTAACTTATTCTAGTTTTGGCGTTTGTGCAG 
E L P K F F ¥ A S . G T ¥Y N € fT DAV L T ¥Y S$ S F GV CC A D 
e 


ATGGTTCTATAATTGCTGTTCAACCACGTAATGTTTCATATGATAGTGTTTCAGCTATCGTCACAGCTAATTTGTCTATACCTTCCAATT 
G S I © A VQ P R x vs ¥ DBS V S A I V T A “ L S$ IT P §& z Ww 


GGACCACTTCGGTCCAGGTTGAGTATTTACAAATTACAAGTACACCTATCGTAGTTGATTGCTCCACTTATGTTTGCAATGGTAATGTGC 
T T § V Q@ Vv FE ¥ LQ I Tf $ T P IVvV DC FS TY VC¢N GN VR 


GCTGTGTTGAATTGCTTAAGCAGTATACTTCTGCTTGTAAAACTATTGAAGACGCCTTAAGAAATAGCGCCAGGCTGGAGTCTGCAGATG 
c v EL UK Q ¥ T S$ A € K T I E DA LRN S A R LE & AD V 


TTAGTGAGATGCTCACTTTTGACAAGAAAGCGTTTACACTTGCTAATGTTAGTAGTTTTGGTGACTACAACCTTAGCAGCGTCATACCTA 
S E M L TY F DB K K A F T LA a ves S$ F G@G D Y¥ 4 L Ss S V I P S&S 


GCTTGCCCACAAGTGGTAGTAGAGTGGCTGGTCGCAGTGCCATAGAAGACATACTTTTTAGCAAACTTIGTTACTTCTGGACTTGGCACTG 
L PT § G S R V AG R SS AT E DI eu F & K LV TS GLE TV 


TGGACGCAGACTACAAAAAGTGCACTAAGGGTCTTTCCATTGCTGACTTGGCTTGTGCTCAATATTATAATGGCATTATGGTTTTGCCTG 
DA DY¥ K K € T K © LBL S&S IT A DLA CC AQ ¥ ¥ N GCG IM VU P G 


GCGTCGCTGATGCTGAACGAATGGCCATGTATACAGGTTCTTTAATTGGTGGAATTGCTTTAGGAGGTCTAACATCAGCCGTTTCAATAC 
vA D A ER M AM ¥ T GQ S&S LI GG tI AL GGL TS AV S IT P 


CATTTTCATTAGCAATTCAGGCACGTTTAAATTATGTTGCATTGCAGACTGATGTTTTACAAGAAAATCAGAAAATTCTTGCTGCATCTT 
F S L A IQ A R LN ¥ VALQgtT DV LQ EN Q K IL A A S F 
LLLSSLDP PDP 


TTAACAAAGCAATGACCAACATAGTAGATGCCTTTACTGGTGTTAATGATGCTATTACACAAACTTCACAAGCCCTACAAACAGTTGCTA 


N K A M T N I V DA F T G VN DA IT Q T S&S Q AL Q T V A T 
LLPLLIPLP LIP PPLLPLLPISPLIPLPEPLELLESLPLELPLEPIPPLEPPILPIEPPPIELILPLELLLP PL LLL LP LLL LLP II LLPILP LIL LPLPIPLP LL LIL LLL LLLP 


CTGCACTTAACAAGATCCAGGATGTTGTTAATCAACAAGGCAACTCATTGAACCATTTAACTICTCAGTTGAGGCAGAATTITCAAGCTA 
A LL N K Ig DY VN @ gg © N S LN H LT S Q LRQN F Q AT 
LPPLIPLPPPLIPLPPLELPPSPPPPISPPPIPLPPS PPPS PSPs SPPPPPPPPLPPAPIA 
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TCTCTAGCTCTATTCAGGCTATCTATGACAGACTTGACACTATT CAGGCTGATCAACAAGTAGATAGGCTGATTACTGGTAGATTGGCTG 
Ss $ &§ I Q A ¥Y ¥ DRL OD T I Q 


CTTTGAATGTATTCGTTTCTCATACATTGACTAAGTACACTGAAGTTCGTGCTTCCAGACAGCTTGCACAACAAAAAGTGAATGAGTGTG 
LN V F ¥V S$ H T LT K ¥ F E 


TCAAATCCCAGTCTAAGCGTTATGGCTTCTGTGGAAATGGCACTCACATTTTCTCAATTGTTAATGCTGCTCCTGAGGGGCTTGTTTITC 
K S Q@ S§ K R ¥ G@ F €C G y G 


TCCACACTGTCTTGTTGCCGACACAATATAAGGATGTTGAAGCGTGGTCTGGGTTGTGCGTTGATGGTACAAACGGTTATGTGTTGCGAC 
HT V &w LP Tf Q ¥ K DV EB 


AACCTAATCTIGCTCTTTACAAAGAAGGCAATTATTATAGAATCACATCTCGCATAATGTTTGAACCACGTATTCCTACCATGGCAGATT 
P NLA UL ¥ K E G N ¥ ¥ R ITT S R I M F E P R I P T 


TTGTT CAAATTGAAAATTGCAATGTCACATTTGTTAACATTTCTCGCTCTGAGTTGCAAACCATTGTGCCAGAGTATATTGATGTTAATA 
ve@Q%tsmIeE5& N ¢ x VCE CE. Uv: a I es R S E 


AGACGCTGCAAGAATTAAGTTACAAATTGCCAAATTACACTGTTCCAGACCTAGTTGTCGAACAGTACAACCAGACTATTTTGAATTTGA 
T L Q@ & LS ¥ K L P Ny Y 


CCAGTGAAATTAGCACCCTTGAAAATAAATCTGCGGAGCTTAATTACACTGTTCAAAAATTGCAAACTCTGATTGACAACATAAATAGCA 
S§ EB I S$ T L E y K S A E 


CATTAGTCGACTTAAAGTGGCTCAACCGGGTTGAGACTTACAT CAAGTGGCCGTGGTGGGTGTGGTIGTGCATTTCAGTCGTGCTCATCT 
LovDL kK WLN RV ET ¥Y I 
TIGTGGTGAGTATGTTGCTATTATGTTGTTGTTCTACTGGTTGCTGTGGCTTCTTTAGTIGTTTTGCATCTTCTATTAGAGGTTGTTGTG 
vovs MLuLu1t©©©s 1 


AATCAACTAAACTTCCTTATTACGACGTTGAAAAGATCCACATACAGTAATGGCT CTAGGTTTGTTCACATTGCAACTTGTGTCTGCTGT 
Ss T K Lb P ¥ ¥ DV EK IHIQ * 
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Fig. 1. Nucleotide sequence of the HCV 229E S gene and the predicted amino acid sequence of the S protein precursor. The amino- 
terminal signal sequence (-—~), the putative membrane anchor (—) and the heptad repeat region (~~~) are underlined. Potential N- 
glycosylation sites (@) and the cysteine-rich region (©) are indicated. The region of homology preceding the S gene and the 4a gene are 


overlined. The positions of the 4a initiation codon and the 5’ upstream ORF termination codon are boxed. 


protein are shown in Fig. 1. Immediately upstream of the 
S gene ORF is a sequence, TCTCAACT, which is 
similar or identical to sequences found adjacent to the 
HCV 229E N and M genes, as well as the putative non- 
structural genes 4a and 5 (see Fig. 2) (Raabe & Siddell, 
1989 a, b; Schreiber et al., 1989). This sequence repre- 
sents the HCV 229E ‘region of homology’ and the 
divergence between the clones 5B5 and 8E10 at this point 
(Fig. 1) confirms this as the site at which fusion of the 
leader and body sequences of the HCV 229E § protein 
mRNA has taken place. 

The AUG codon which initiates the S protein gene 
(nucleotide 39, Fig. 1) is in a favoured context (Kozak, 
1983) and opens a reading frame of 3519 nucleotides 
which encodes a polypeptide of 1173 amino acids with an 
M, of 128-6K. The predicted S protein polypeptide 
contains 30 potential N-glycosylation sites (NXS or 
NXT) and the difference in the apparent M, of the HCV 
229E S protein (186K ; Schmidt & Kenny, 1982) and the 
predicted size of the polypeptide suggest that the 
majority of these sites are used. 

At the amino terminus of the polypeptide is a stretch of 
14 mainly hydrophobic amino acids followed by two 
amino acids with small uncharged side-chains, a feature 
typical of a signal peptidase recognition site (von Heijne, 
1984). At the carboxy terminus, between amino acids 
1116 and 1138, a second strongly hydrophobic region can 
be recognized which is believed to serve as the trans- 


membrane anchor (de Groot et al., 1987a). This region is 
flanked on the amino-terminal side by the sequence 
KWPWWVWL, which differs by only one amino acid 
from the sequence KWPWYVWL which js conserved in 
all coronavirus S protein genes sequenced to date. On the 
carboxy-terminal side, the membrane anchor region is 
flanked by an unusually high number of cysteine 
residues. This feature has also been recognized in other 
coronavirus S proteins (Rasschaert & Laude, 1987; 
Schmidt et al., 1987) and it has been proposed that at 
least some of these residues may be involved in the 
acylation of the S protein which has been described for 
MH V (Sturman et a/., 1985, van Berlo et al., 1987). In the 
HCV 229E §S protein sequence it is also possible to 
identify the ‘heptad repeat’ structures (corresponding to 
amino acids 794 to 849, Fig. 1) which have been 
proposed by de Groot et al. (1987a) and Rasschaert & 
Laude (1987) to be essential elements in forming the 
elongated structure of the S protein. 

Finally, the predicted sequence of the HCV 229E S 
protein does not reveal any basic amino acid sequences 
related to the motifs RRXRR or RRAHR (where X is F, 
S, H or A) which have been identified as the sites at 
which MHV and IBV S proteins are proteolytically 
cleaved to yield the SI and S2 polypeptides (Spaan et al., 
1988). These motifs are also absent in the FIPV and 
TGEV S proteins, which are apparently not cleaved 
(Garwes & Reynolds, 1981; Horzinek ef a/., 1982). 
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Fig. 2. A proposed model for the organization and expression of the HCV 229E genome. The coding regions for the structural proteins 
(S, M, N) and the non-structural proteins (4a, 4b, 5) are shown in relation to the genomic and subgenomic RNAs. The black boxes at the 
5’ end of the RNAs represent a common leader sequence which has been demonstrated for the Sand N mRNAs (this paper ; Schreiber et 
al., 1989). The positions of the oligonucleotides A to G are indicated (@). Also shown are the positions and sequences of the homology 
regions and the extent of the cDNA clones used in this study. 
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Genomic organization of HCV 229E 


Together with the data presented in this report a 
continuous sequence of 6-7 kb at the 3’ end of the HCV 
229E genome has been determined. Within this sequence 
the regions encoding the S, M and N proteins have been 
identified on the basis of the sizes of the ORFs and the 
characteristics of the predicted polypeptides (Raabe & 
Siddell, 1989 a; Schreiber et a/., 1989). In addition, three 
large ORFs which are supposed to encode non-structural 
proteins have been found between the S and M genes 
(Raabe & Siddell, 19895). This arrangement of ORFs 
with respect to the genomic RNA is summarized in 
Fig. 2. 

In order to identify the subgenomic RNAs that code 
for the S, M, N and non-structural proteins, we have 
done Northern blot analysis using synthetic oligonucleo- 
tides and a cDNA probe (Fig. 3 shows the localization of 


these probes). The cDNA probe, pSMF1, which encom- 
passes the N protein gene plus the 3’ non-coding region, 
detects seven virus-specific RNAs which have been 
numbered 1 to 7 in order of decreasing size (Fig. 3). The 
M gene-specific oligonucleotide G hybridized to the 
RNAs | to 6, but not RNA 7. The S gene-specific 
oligonucleotide A (complementary to nucleotides 1209 to 
1226, Fig. 1) hybridized only to RNAs 1 and 2. Assuming 
that the HCV RNAs are arranged as a 3’ coterminal 
nested set and that the 5’ unique regions are translated, 
these results lead us to propose that the RNAs 2, 6 and 7 
encode the structural proteins S, M and N, respectively. 

The oligonucleotide B, as well as all other oligonucleo- 
tides except A, hybridized to an RNA which we and 
others have termed RNA 3. This result was unexpected 
because the sequences complementary to oligonucleotide 
B lie within the S gene ORF (nucleotides 2379 to 2396, 
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Fig. 3. Northern blot analysis of HCV 229E RNA. The polyadenylated 
RNA of HCV 229E-infected C16 cells was electrophoresed in 
formaldehyde-agarose gels and transferred to nitrocellulose mem- 
branes. 3?P-labelled oligonucleotides A to G (see Methods for the 
sequence and Fig. 2 for the location) and a cDNA clone corresponding 
to the HCV 229E N gene and 3’ non-coding region were used as 
hybridization probes. The HCV 229E-specific RNAs were numbered 
(I to 7) according to their decreasing size. 


Fig. 1). There are no sequences within the S gene ORF 
which resemble a ‘region of homology’ and at the 
moment we have no reason to suppose that this RNA 
functions as an MRNA. 

We have previously described three ORFs in the 
region between the S and M genes of HCV 229E (Raabe 
& Siddell, 19895). However, there are only two viral 
RNAs (RNA 4 and RNA 5) whose unique regions 
encompass this area. To assign these ORFs to the RNAs 
we therefore did hybridizations with oligonucleotides 
located at the 5S’ end of the ORF 4a (oligonucleotide C), at 
the 5’ end of ORF 4b (oligonucleotide D), at the 3’ end of 
ORF 4b (oligonucleotide E) and at the 5’ end of ORF 5 
(oligonucleotide F) (Fig. 3). As expected, oligonucleotide 
C hybridized to RNA 4 and oligonucleotide F to RNA 5. 
Oligonucleotide D which corresponds to the 5’ end of 
ORF 4b does not hybridize to RNA 5, but a clear positive 
signal is obtained for RNA 5 using oligonucleotide E. 
This indicates that the 5’ end of the RNA 5 body extends 


well into, but not over the complete coding region of 
ORF 4b. In the light of these data we re-examined the 
ORF 4b sequence and found a perfect ‘region of 
homology’ motif, TCTCAACT, at a position 107 
nucleotides downstream from the ORF 4b initiation 
codon. This indicates, in contrast to our previous 
suggestion (Raabe & Siddell, 19894), that in functional 
terms the ORFs 4a and 4b should be assigned to the 
‘unique’ region of RNA 4 and the ORF S to the unique 
region of RNA S. 

On the basis of the available sequence data, analogy to 
other coronaviruses and the hybridization experiments 
described here, we propose a model of the organization 
and expression of the HCV 229E genome as is shown in 
Fig. 2. 


Discussion 


As we have described above, inspection of the HCV 
229E S protein sequence reveals a number of features 
which are typical of coronavirus S proteins, for example, 
the amino-terminal signal sequence, the carboxy-termi- 
nal membrane anchor and the carboxy-terminal cysteine 
cluster. In order to search for further structural features 
whose conservation may indicate an important func- 
tional role, we have made a computer-aided comparison 
of the HCV 229E S protein sequence with the published 
S protein sequences of FIPV (de Groot et al., 19875), 
TGEV (Jacobs et al., 1987), IBV (Binns et al., 1985) and 
MHV JHM (Schmidt et a/., 1987). Firstly, we made an 
‘optimal’ alignment of all sequences using the UWGCG 
GAP program. This alignment (which is available from 
the authors upon request) set matches, i.e. identical 
amino acids, equal to 1-5 and mismatches equal to lower 
values based upon the evolutionary distance between the 
amino acids as measured by Dayhoff and normalized by 
Gribskov & Burgess (1986). These alignments were then 
displayed using the program GAPSHOW, with a match 
display threshold of 1-5, i.e. only identical amino acids 
are displayed. Finally, we marked the positions of 
potential N-glycosylation sites as well as cysteine 
residues in all sequences. The result is shown in Fig. 4. 

A number of important conclusions can be reached. 
Firstly, it is evident that the similarity of the HCV 229E 
sequence to the FIPV and TGEV sequences is much 
greater than to the IBV or MHV sequences. Moreover, as 
has been previously noted (de Goot et a/., 1987 a), there is 
more similarity in the carboxy-terminal halves of these 
proteins than in the amino-terminal halves. These 
similarities are summarized in Table I. 

Although the amino-terminal halves of the corona- 
virus S proteins are less well conserved with respect to 
length and amino acid composition, it is interesting to 
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Fig. 4. Structural comparison of the S protein of HCV 229E and the S proteins of FIPV, TGEV, IBV and MHV JHM. The figure shows 
the positions of identical amino acids after optimal alignment of all sequences ( ii Ill ). The ‘gaps’ introduced for alignment are shown as 
boxes ( m_). The positions of potential N-glycosylation sites ( ? ), cysteine residues (|) and the post-translational cleavage sites of the 
IBV and MHV proteins (4 ) are indicated. Details of the UWGCG programs GAP and GAPSHOW are given in the text. The F IPV, 
TGEV, IBV and MHV JHM S gene sequences were determined by de Groot et al. (19875), Jacobs et al. (1987), Binns et al. (1985) and 


Schmidt et al. (1987). 


Table 1. Sequence comparison* of the S polypeptide of 
HCV 229E and the S polypeptides of FIPV, TGEV, IBV 
and MHV JHM 


HCV 229E HCV 229F 
1-543 544-1173 
Viral § Identity Similarity Identity Similarity 
protein Residues %) (%) (%) (%) 
FIPV 1-786 38+ 56 7 - 
FIPV 787-1452 ~ - 57 75 
TGEV 1-781 37 56 xs - 
TGEV 782-1447 - - 57 74 
IBV 1-535 18 38 - - 
IBV 536-1163 ~ 7 36 53 
MHV JHM 1-626 16 36 7 = 
MHV JHM_ 627-1235 - - 32 54 


* The sequences were aligned using the UWGCG GAP program. 
} The figures given are the percentage amino acid identity or 
similarity following optimal alignment. 


note that the ‘optimal’ alignment of the HCV S protein 
sequence to the FIPV or TGEV sequences results in a 
large amino-terminal gap. Jacobs et al. (1987) have 
reported a striking discontinuity in the levels of amino 


acid homology within the FIPV and TGEV S proteins. 
At the amino terminus (nucleotides | to 274) the mean 
homology is 30%, whereas the remaining sequences are 
94% homologous. These authors have suggested that this 
observation could be explained by recombination 
between coronaviruses and our analysis is consistent 
with this interpretation. 

It is worth noting that although the similarity between 
the HCV 229E and FIPV S proteins in positions | to 543 
and 1 to 786, respectively, is only 38%, roughly 50% of 
the cysteine residues in this region of both sequences are 
located at the ‘same’ position. For the corresponding 
region of the HCV and MHV proteins (positions | to 543 
and | to 580, respectively) only about 17% of the cysteine 
residues show this relationship. 

Within the carboxy-terminal half of all S proteins 
there is an evident clustering of N-glycosylation sites ata 
position where the polypeptide is thought to emerge to 
the outside of the lipid bilayer (de Groot et al., 19874). 
Also, in addition to the carboxy-terminal cysteine 
cluster, we have now identified a number of cysteine 
residues that are conserved within the carboxy-terminal 
half of all S proteins. Striking, for example, are the 
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residues corresponding to the positions 608, 613, 619, 
630, 715, 726, 917, 928 and 967 in the HCV sequence. It is 
clear that the relevance of features such as these will be 
fully appreciated only when a three-dimensional image 
of the S protein becomes available. 

The number and sizes of the HCV 229E RNAs 
identified in our Northern blot analysis are in agreement 
with previously published results (Schreiber ef a/., 1989; 
Weiss & Leibowitz, 1981). By analogy to other coronavir- 
uses and on the basis of new hybridization data, we have 
now proposed coding assignments for five of these 
RNAs (Fig. 2). These assignments and the mRNA 
function of the RNAs need to be confirmed by in vitro 
translation of purified or synthetic RNAs, together with 
identification of the translation products using HCV 
protein-specific antibodies. In particular, it will be 
necessary to determine the coding capacity of RNA 4, 
which our data suggest has two ORFs in the 5’ unique 
region, and RNA 5 which appears to have an unusually 
long 5’ non-coding region. The availability of cDNA 
clones encompassing these genes will facilitate coupled 
transcription-translation experiments as have been 
described for MHV (Budzilowicz & Weiss, 1987). We 
expect that these studies will show that the replication 
strategy of HCV 229E closely parallels those of other 
coronaviruses. 

At the moment we are not able to judge the relevance 
of the RNA 3 species which is detected by our 
hybridization probes and has been previously identified 
as a virus-specific RNA by metabolic labelling in the 
presence of actinomycin D (Schreiber et al., 1989). It is 
not clear whether the RNA should be considered a 
putative mRNA or whether it represents, for example, 
an intracellular defective RNA or even a replicative 
form component. We hope to be able to resolve this 
question by sequence analysis of a cDNA corresponding 
to this RNA. 

In addition to the S and M glycoproteins, MHV JHM, 
BCV and HCV OC43 possess a third glycoprotein, HE, 
which has both receptor-destroying and receptor-bind- 
ing activities (M. Pfleiderer & S. Siddell, unpublished; 
Vlasak et al., 1988). For MHV JHM and BCV, the gene 
encoding this protein is located immediately upstream of 
the S protein gene (Parker et al., 1989; Shieh et al., 1989). 
In the course of these studies, we have sequenced 
approximately 0-15 kb upstream of the HCV 229E S gene 
and our analysis revealed an ORF, the deduced amino 
acid sequence of which displays a high homology with 
the carboxy terminus of the IBV gene F (polymerase) 
product (data not shown) (Boursnell et al., 1987). Taken 
together with the fact that HCV 229E does not have a 
receptor-binding (haemagglutinating) activity (Hier- 
holzer, 1976), and our Northern blot analysis which did 
not reveal any additional RNAs between RNA 1 and 


RNA 2, these data strongly suggest that the HCV 229E 
genome does not contain a haemagglutinin—esterase 
gene. 

In this paper we have proposed a model for the 
organization and expression of the HCV 229E genome 
and presented the predicted amino acid sequence of the 
spike glycoprotein. These data provide an essential basis 
to investigate the replication of the virus, as well as the 
structure, function, immunological and biological pro- 
perties of the S protein. These studies will undoubtedly 
be important for our understanding of the pathogenesis 
and epidemiology of a widespread human infection. 


We would like to thank Dr S. Myint for providing the plasmid 
pSMF'I and Helga Kriesinger for typing the manuscript. This work was 
supported by Sonderforschungsbereich 165, B1. The sequence data 
presented in this paper will appear in the EMBL/GenBank/DDBJ 
Nucleotide Sequence Databases under the accession number X16816. 
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